Often, it is important to import information from a variety of sources and output the result. A few ways of creating and saving files are demonstrated.
By the end of this file you should have seen simple examples of:
Further reading:
http://docs.h5py.org/en/latest/index.html
In [1]:
# Python Imports:
import numpy as np
import scipy.io as sio
%cd datafiles
!ls
In [2]:
kb_contents = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'
print(kb_contents)
In [3]:
# Read line by line:
file_obj = open('01-simpletext.txt','r')
for line in file_obj:
print(line)
file_obj.close()
In [4]:
# Use the read method:
file_obj = open('01-simpletext.txt','r')
file_contents = file_obj.read()
file_obj.close()
print(file_contents)
In [5]:
# Python 'with' statement automatically takes care of the close for us:
with open('01-simpletext.txt','r') as file_obj:
print(file_obj.read())
In [6]:
# Write to ascii files:
file_obj = open('01-simpletext_write.txt','w')
file_obj.write(file_contents)
file_obj.close()
# Or, alternatively:
with open('01-simpletext_write.txt','w') as file_obj:
file_obj.write(file_contents)
# Check that our written output is good:
with open('01-simpletext_write.txt','r') as file_obj:
print(file_obj.read())
In [7]:
# Creating a python list:
with open('01-simpledata.csv','r') as file_obj:
file_contents = file_obj.read().split(',')
print(file_contents)
In [8]:
# Use numpy to read an array from a file
file_contents = np.loadtxt(open('01-simpledata.csv'), delimiter=",")
file_contents = file_contents.astype('float')
print(file_contents)
In [9]:
# Save output of numpy array to csv file
file_contents_write = file_contents*2 #Double to differentiate read vs write data
np.savetxt('01-simpledata_write.csv',file_contents_write, '%0.3f', delimiter=",")
# %0.3f specifies scientific notation with 3 decimal places
file_contents = np.loadtxt(open('01-simpledata_write.csv'), delimiter=",")
print(file_contents)
Binary files store the same information as text or csv, but do so directly in bytes, rather than using ascii to encode. They have the advantage of being faster to read and smaller in size, but are not readily readable by a typical text editor (notepad, vim, sublime, etc).
Note: be careful to avoid numpy.fromfile
and numpy.tofile
as they are not platform independent!
In [10]:
# Read in the csv from the previous step:
file_contents = np.loadtxt(open('01-simpledata_write.csv'), delimiter=",")
print(file_contents)
In [11]:
# Save as a binary file:
np.savetxt('01-simpledata_write.bin', file_contents_write*2) # Note the lack of demiliter
file_contents = np.loadtxt('01-simpledata_write.bin')
# The following is not recommended, as it is platform dependent:
#np.ndarray.tofile(file_contents_write, '01-simpledata_write.bin')
#file_contents = np.fromfile('01-simpledata_write.bin')
print(file_contents)
In [12]:
# Use scipy to read in .mat files:
mat_contents= sio.loadmat('01-simplemat.mat')
testvar = mat_contents['testvar']
print(testvar)
In [13]:
# Use scipy to write .mat files:
testvar_write = testvar*2 # Double to make read data different from write data
sio.savemat('01-simplemat_write.mat' ,{'testvar_write':testvar_write})
mat_contents = sio.loadmat('01-simplemat_write.mat')
testvar = mat_contents['testvar_write']
print(testvar_write)
HDF5 or Hierarchical Data Format provides a file format that has a much greater amount of flexibility at the cost of a bit more complexity. HDF5 is ideal when there would otherwise have been many small files. There are two main objects:
In [14]:
import h5py
In [15]:
# Load csv data:
data_csv = np.loadtxt(open('01-simpledata_write.csv'), delimiter=",")
# Load mat data:
data_mat = sio.loadmat('01-simplemat_write.mat')['testvar_write']
# Load text data:
with open('01-simpletext.txt','r') as file_obj:
data_txt = file_obj.read()
In [16]:
# Create a h5py file object:
with h5py.File("01-data_write.hdf5", "w") as file_obj:
# Use file_obj to create data sets
# Create a dataset object and assign the values from data:
dataset1 = file_obj.create_dataset("data", data = data_csv)
Check that the data has been written to the file by opening it:
In [17]:
with h5py.File("01-data_write.hdf5", 'r') as file_obj:
print(file_obj["data"].name)
print(file_obj["data"].value)
The "Hierarchical" part of the HDF5 file format provides groups, which act like Python dictionaries or 'folders' for the various Datasets.
In [18]:
# Open the same h5py file object:
with h5py.File("01-data_write.hdf5", "w") as file_obj:
# Create a group object, and create datasets underneath it:
grp_nums = file_obj.create_group("Numbers")
dataset_csv = grp_nums.create_dataset("CSV", data=data_csv)
dataset_mat = grp_nums.create_dataset("MAT", data=data_mat)
# Create a second group object, and create datasets underneath it:
grp_txt = file_obj.create_group("Text")
txt_hf5 = np.asarray(data_txt, dtype="S") # Convert to NumPy S dtype:
dataset_txt = grp_txt.create_dataset("lorem", data=txt_hf5)
After saving this data, check the file structure:
In [19]:
def print_attrs(name, obj): # Function that prints the name and object
print(name)
print(obj)
with h5py.File("01-data_write.hdf5", 'r') as file_obj:
file_obj.visititems(print_attrs) # Use .visititems to get info
In [20]:
with h5py.File("01-data_write.hdf5", 'r') as file_obj:
print(file_obj["/Numbers/CSV"].name)
print(file_obj["/Numbers/CSV"].value)
print(file_obj["/Numbers/MAT"].name)
print(file_obj["/Numbers/MAT"].value)
print(file_obj["/Text/lorem"].name)
print(file_obj["/Text/lorem"].value)
For coinvenience, it's possible to print all of the information using .visititems
:
In [21]:
def print_attrs(name, obj):
print(name)
if isinstance(obj, h5py.Group):
print(obj)
if isinstance(obj, h5py.Dataset):
print(obj.value)
with h5py.File("01-data_write.hdf5", 'r') as file_obj:
file_obj.visititems(print_attrs)
h5py also allows storing of metadata relating to data - check the h5py documentation for more info: http://docs.h5py.org/en/latest/index.html